Scalable epidemic message passing interface fault tolerance

نویسندگان

چکیده

Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC) extreme scale systems. Components fail more often such systems, results application abort. Adopting fault–tolerance techniques can be consistently detect failures continue application’s execution even if exist. A prominent parallel programming specification, message passing interface (MPI), as it would used to implement failure detection consensus algorithm this paper. Although MPI does not facilitate tolerant behavior, work presents a tolerant, matrix based algorithm. The proposed uses Gossiping. To failures, randomised pinging will applied during by using piggybacked gossip messages. In order achieve on system, failed processes’ information sent same messages all alive processes. was implemented framework is completely tolerant. exhibit process were detected global has achieved system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault Tolerance in Message Passing Interface Programs

In this paper we examine the topic of writing fault-tolerant Message Passing Interface (MPI) applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude ...

متن کامل

RADIC-based Message Passing Fault Tolerance System

We present an analysis design of how to incorporate a transparent fault tolerance system at socket level for message passing applications. The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure. Moreover, a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel p...

متن کامل

Recent Results on Fault-Tolerance Consensus in Message-Passing Networks

This paper surveys recent results on fault-tolerant consensus in message-passing networks. We focus on two categories of works: (i) new problem formulations (including input domain, fault model, network model...etc.), and (ii) practical applications. For the second part, we focus on Crash Fault-Tolerant (CFT) systems that use Paxos or Raft, and Byzantine Fault-Tolerant (BFT) systems. We also br...

متن کامل

Mpi: a Message Passing Interface

The MPI Forum This paper presents an overview of mpi, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of mpi has been a collective eeort involving researchers in the United States and Europe from many organizations and institutions. mpi includes point-to-point and collective communication routines, as well as support for process groups,...

متن کامل

Mpi: a Message Passing Interface

The MPI Forum This paper presents an overview of mpi, a proposed standard message passing interface for MIMD distributed memory concurrent computers. The design of mpi has been a collective eeort involving researchers in the United States and Europe from many organizations and institutions. mpi includes point-to-point and collective communication routines, as well as support for process groups,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Bulletin of Electrical Engineering and Informatics

سال: 2022

ISSN: ['2302-9285']

DOI: https://doi.org/10.11591/eei.v11i2.3374